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We consider the dynamics of Q-Iearning in two-player two-action games with a Boltzmann explo- 
ration mechanism. For any non-zero exploration rate the dynamics is dissipative, which guarantees 
that agent strategies converge to rest points that are generally different from the game's Nash 
Equlibria (NE). We provide a comprehensive characterization of the rest point structure for differ- 
ent games, and examine the sensitivity of this structure with respect to the noise due to exploration. 
Our results indicate that for a class of games with multiple NE the asymptotic behavior of learning 
dynamics can undergo drastic changes at critical exploration rates. Furthermore, we demonstrate 
that for certain games with a single NE, it is possible to have additional rest points (not corre- 
sponding to any NE) that persist for a finite range of the exploration rates and disappear when the 
exploration rates of both players tend to zero. 

PACS numbers; 02.50.Le,87.23.Cc,87.23.Ge,05.45.-a 
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I. INTRODUCTION 

Reinforcement Learning (RL) Ij is a powerful frame- 
work that allows an agent to behave near-optimally 
through a trial and error exploration of the environment. 
Although originally developed for single agent settings, 
RL approaches have been extended to scenarios where 
multiple agents learn concurrently by interacting with 
each other. The main difficulty in multi-agent learning 
is that, due to mutual adaptation of agents, the station- 
arity condition of single-agent learning environment is 
violated. Instead, each agent learns in a time- varying 
environment induced by the learning dynamics of other 
agents. Although in general multi-agent RL does not 
have any formal convergence guarantees (except in cer- 
tain settings), it is known to often work well in practice. 

Recently, a number of authors have addressed the is- 
sue of multi-agent learning from the perspective of dy- 
namical systems For instance, it has been noted 
that for stateless Q-learning with Boltzmann action se- 
lection, the dynamics of agent strategies can be described 
by (bi-matrix) replicator equations from population bi- 
ology 0, with an additional term that accounts for the 
exploration j^-Q ■ A similar approach for analyzing learn- 
ing dynamics with e-greedy exploration mechanism ^ was 
developed in [l^] . 

Most existing approaches so far have focused on numer- 
ical integration or simulation methods for understand- 
dynamical behavior of learning systems. Recently, 
provided a full categorization of e-greedy Q-learning 
dynamics in two-player two-action games using analyt- 
ical insights from hybrid dynamical systems. A simi- 
lar classification for Boltzmann Q-learning, however, is 



^ The £-greedy Q-learning schema selects the action with highest 
Q value with probability (1 — e) + ^ and other actions with 
probability of — , where n is the number of the actions. 



lacking. On the other hand, a growing body of recent 
neurophysiological studies indicate that Boltzmann-type 
softmax action selection might be a plausible mechanism 
for understanding decision making in primates. For in- 
stance, experiments with monkeys playing a competitive 
game indicate that their decision making is consistent 
with softmax value-based reinforcement learning [111, [13 ■ 
It has also been observed that in certain observational 
learning tasks humans seem to follow a softmax reinforce- 
ment leaning scheme [is]. Thus, understanding softmax 
learning dynamics and its possible spectrum of behaviors 
is important both conceptually and for making concrete 
prediction about different learning outcomes. 

Here we use analytical techniques to provide a com- 
plete characterization of Boltzmann Q-Learning in two- 
player two-action games, in terms of their convergence 
properties and rest point structure. In particular, it is 
shown that for any finite (non-zero) exploration rate, 
the learning dynamics necessarily converges to an inte- 
rior rest point. This seems to be in contrast with pre- 
vious observation IJ), where we believe the authors 
have confused slow convergence with limit cycles. Fur- 
thermore, none of the studies so far have systematically 
examined the impact of exploration, i.e., noise, on the 
learning dynamics and its asymptotic behavior. On the 
other hand, noise is believed to be an inherent aspect of 
learning in humans and animals, either due to softmax 
selection mechanisms ^15i] , or random perturbations in 
agent utilities [T6j . Here we provide such an analysis, 
and show that depending on the game, there can be one, 
two, or three rest points, with a bifurcation between dif- 
ferent rest-point structures as one varies the exploration 
rate. In particular, there is a critical exploration rate 
above which there remains only one rest point, which is 
globally stable. 

The rest of this paper is organized as follows: We next 
describe the connection between Boltzmann Q-learning 
and replicator dynamics, and elaborate on the non- 
conservative nature of dynamics for any finite exploration 
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rate. In Section|TTT]we analyze the asymptotic behavior of 
the learning dynamics as a function of exploration rates 
for different game types. In Section HV] we illustrate our 
findings on several examples. We provide some conclud- 
ing remarks in Section fVl 

II. DYNAMICS OF Q-LEARNING 

Here we provide a brief review of Q-lcarning algorithm 
and its connection with the replicator dynamics. 

A. Single Agent Learning 

In Reinforcement Learning (RL) [l| agents learn to be- 
have near-optimally through repeated interactions with 
the environment. At each step of interaction with the 
environment, the agent chooses an action based on the 
current state of the environment, and receives a scalar 
reinforcement signal, or a reward, for that action. The 
agent's overall goal is to learn to act in a way that will 
increase the long-term cumulative reward. 

Among many different implementation of the above 
adaptation mechanisms, here we consider the so called 
Q-learning (l7[ . where the agents' strategies are param- 
eterized through Q-functions that characterize relative 
utility of a particular action. Those Q-functions are up- 
dated during the course of the agent's interaction with 
the environments, so that actions that yield high rewards 
are reinforced. To be more specific, assume that the agent 
has a finite number of available actions, z = l,2,...,n, and 
let Qi [t) denote the Q- value of the corresponding action 
at time t. Then, after selecting action i at time i, the 
corresponding Q- value is updated according to 

Q^{t + l) = Q^{t) + a[n{i)-Q,{t)] (1) 

where ri{t) is the observed reward for action i at time i, 
and a is the learning rate. 

Next, we need to specify how the agent selects actions. 
Greedy selection, when the action with the highest Q 
value is selected, might generally lead to globally subop- 
timal solution. Thus, one needs to incorporate some way 
of exploring less-optimal strategies. Here we focus on 
Boltzmann action selection mechanism, where the prob- 
ability Xi of selecting the action i is given by 

eQ.(*)/T 

'^'W^ ^^eg,(t)/T , » = 1,2,--- ,n. (2) 

where the temperature T > controls explo- 
ration/exploitation tradeoff: for T the agent always 
acts greedily and chooses the strategy corresponding to 
the maximum Q-value (pure exploitation), whereas for 
T ^ oo the agent's strategy is completely random (pure 
exploration). 

We are interested in the continuous time limit of the 
above learning scheme. Toward this end, we divide the 



time into intervals 5t, replace t + 1 with t + St and a 
with aSt. Next, we assume that within each interval 
St, the agent samples his actions, calculates the average 
reward for action i, and applies Eq. [1] at the end of 
each interval to update the Q- values.^ 

In the continuous time limit St — !■ 0, one obtains the 
following differential equation describing the evolution of 
the Q values: 

Q,(i) = «h(t)-Q.(i)] (3) 

Next, we would like to express the dynamics in terms 
of strategies rather than the Q values. Toward this end, 
we differentiate Eq.[2]with respect to time and use Eq. |31 
After rescaling the time, t at/T , we arrive at the 
following set of equations: 

— = h-y2i. ,^krk]-Ty2, Xk\n—. (4) 

The first term in Eq. |3] asserts that the probability of 
taking action i increases with a rate proportional to the 
overall efficiency of that strategy, while the second term 
describes the agent's tendency to randomize over possible 
actions. The steady state strategy profile, xf, if it exists, 
can be found from equating the right hand side to zero, 
which can be shown to yield 



We would like to emphasize that a;f corresponds to the 
so called Gibbs distribution for a statistical-mechanical 
system with energy — at temperature T. Indeed, it can 
be shown that the above replicator dynamics minimizes 
the following function resembling free energy: 

$[x] = -^JkXk + T^^Xk Inxk (6) 

where we have denoted x = (xi, ■ • • ,x„), X^ILi^* ~ 
Note that the minimizing the first term is equivalent to 
maximizing the expected reward, whereas minimizing the 
second term means maximizing the entropy of the agent 
strategy. The relative importance of those terms is regu- 
lated by the choice of the temperature T. We note that 
recently a free energy minimization principle has been 
suggested as a framework for modeling perception and 
learning (see [l^ for a review of the approach and its 
relation to several other neurobiological theories) . 



^ In the terminology of reinforcement learning, this corresponds to 
an off-policy learning, as opposed to on-policy learning, where 
one uses Eq. [2] and Eq. [T] concurrently to sample actions and 
update the Q- values of those action, respectively (e.g., see 
A potential issue with the latter scheme is that actions that are 
played rarely will be updated rarely, which might be problematics 
for the convergence of the algorithm. A possible remedy is to 
normalize each update amount by the frequency of corresponding 
action P, ll^ , which can be shown to lead to the same dynamics 
Eq. [3]in the continuous time limit. 



3 



B. Two-agent learning 

Let us now assume there are two agents that are learn- 
ing concurrently, so that the rewards received by the 
agents depend on their joint action. The generalization 
to this case is introduced via game-theoretical ideas [13] ■ 
More specifically, let A and B be the two payoff matrices: 
aij (bij) is the reward of the first (second) agent when he 
selects i and the second (first) agent selects j. Further- 
more, let y = (?/i, • • • ,2/„), I]r=i2/i = 1' ^e the strategy 
of the second agent. The expected rewards of the agents 
for selecting action i are as follows: 



E y — ^ 



(7) 



The learning dynamics in two-agent scenario case is ob- 
tained from Eq. 2] by replacing ri with rf and rj for the 
first and second agents, respectively, which yields 



Xi — Xi I 



,[{Ay), Ay + TxY^x, \n{x,/x^)] (8) 

3 

iji = yt[{Bx)i - y ■ Bx + Ty^Vj \n{yj/y,)] (9) 

j 

where {Ay)i is the i element of the vector Ay, and we as- 
sume that the exploration rates Tx and Ty of the agents 
can generally be different. This system (without the ex- 
ploration term) is known as bi-matrix replicator equa- 
tion [20I, [2l|. Its relation to multi-agent learning has 
been examined in [1, d, [H-il . 

Before proceeding further, we elaborate on the con- 
nection between the rest-points of the replicator system 
Eqs. [51 [HI and the game-theoretic notion of Nash Equi- 
librium (NE), which is a central concept in game theory. 
Recall that a joint strategy profile (x*,y*) is called NE 
if no agent can increase his expected reward by unilater- 
ally deviating from the equilibrium. It is known that for 
Tx = Ty = 0, all the NE of a game are also rest-points 
of the dynamics [2^ . The opposite is not true - not all 
the rest points correspond to NE. Furthermore, some NE 
might correspond to unstable rest points of the dynamics, 
which means that they cannot be achieved by the learn- 
ing process. For any finite Tx,Ty > 0, the rest points 
will be generally different from the NE of the game. In 
the limit Tx,Ty — > 00, agents are insensitive to the re- 
wards and mix uniformly over the actions. In this work 
we study the behavior of the learning dynamics in the 
intermediate range of exploration rates. 

C. Exploration causes dissipation 

It is known that for Tx = Ty = the system of 
Eqs. [51 [HI are conservative so that the total phase 

space volume is preserved. It can be shown, however, 
that any finite exploration rate Tx , Ty > makes the 
system dissipative or volume contracting While this 
fact might not be crucial in high-dimensional dynamical 



system, its implications for low-dimensional system, and 
specifically for two-dimensional dynamical system con- 
sidered here are crucial. Namely, the finite dissipation 
rate means that the system cannot have any limit cycles, 
and the only possible asymptotic behavior is a conver- 
gence to a rest point. Furthermore, in situation when 
there is only one interior rest point, it is guaranteed to 
be globally stable. 

To demonstrate the dissipative nature of the system 
for Tx,Ty > 0, it is useful to make the following trans- 
formation of variables 



, Xk+l 

Uk = In , Vk 

Xi 



In^, k 

yi 



i,2,-.. 



1. (10) 



The replicator system in the modified variables reads 

El 



Uk 



TxUk , Vk 



1 + E,e"^ 



TyVkill) 



where 



o,kj — Ofe+ij+i — aij+i , bkj — bk+ij+i — 0.1, j+i (12) 

Let us recall the Liouville formula: If z = F(z) is defined 
on the open set U in R" and ii G C U has volume V{t) 
of G{t) = {x{t) : X G G}, then the rate of change of a 
volume V, which contain of set of points G in the phase 
space is proportional to the divergence of F Q . Consult- 
ing with Eqs. [Til 'we observe that the dissipation rate is 
given by [Q] 



duk dvk 



= -iTx+Ty)in-l) <0 (13) 



As we mentioned above, the dissipative nature of the dy- 
namics has important implications for two-action games 
that we consider next. 



D. Two— action games 

Let us consider two action games, and let x and y 
denote the probability of selecting the first action by the 
first and second agents, respectively. Then the learning 
dynamics Eqs. [51 [5] attain the following form: 



{ay + 6) — In 



x{l — x) 

y 



y(i - y) 

where we have introduced 



{cx + d) — In 



y 



i-y 



(14) 
(15) 



a2i + ai2 — Oil — 022 , ai2 — 022 
a = , = (16) 



Tx 



Tx 



^21 + bi2 — bii — 622 J bi2 — 622 

— T= (17) 



Ty 



Ty 
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The vertices of the simplex {x,y} = {0,1} are rest 
points of the dynamics. For any Tx,Ty > 0, those rest 
points can be shown to be unstable. This means that 
any trajectory that starts in the interior of the simplex, 
< a:,y < 1, will asymptotically converge to an interior 
rest point. The position of those rest points is found by 
nullifying the RHS of Eqs. [T31 [HI For the remaining of 
this paper, we will examine the interior rest point equa- 
tions in details. 



III. ANALYSIS OF INTERIOR REST POINTS 

A. Symmetric Equlibria 

First, we consider the case of symmetric equilibria, x = 
y and Tx = Ty = T, in which case the interior rest point 
equation is 



aa; + 6 = In ■ 



1 



(18) 



Graphical representation of Eq.[TH]is illustrated in Fig.[T] 
where we plot both sides of the equation as a function of 
X. First of all, note that the RHS of Eq.[TH]is a monotoni- 
cally increasing function, assuming values in (— oo, oo) as 
X changes between (0, 1). Thus, it is always guaranteed 
to have at least one solution. Further inspection shows 
that the number of possible rest points depends on the 
type of the game as well as the temperature T. 




FIG. 1: The graphical illustration of the rest point equation 
for the symmetric case, Eq. 1181 The solid curve corresponds 
to the RHS, and the three lines correspond to the LHS for 
subcritical, critical and supercritical temperature values, re- 
spectively. 

For instance, there is a single solution whenever a < 0, 
for which the LHS is a non-increasing function of x. 

Next, we examine the condition for having more than 
one rest point, which is possible when a > 0. Consult 
with Fig. [I] For sufficiently large temperature, there is 
only a single solution. When decreasing T, however, a 
second solution appears exactly at the point where the 



LHS becomes tangential to the RHS. Thus, in addition 
to Eq. [THl at the critical temperature we should have 



1 



x{l — x) ' 



or, alternatively. 



1± 



1-1 
a 



(19) 



(20) 



Note that the above solution exists only when a > 4. 
Plugging Uni into nil we find 



6 = In -(a ± a) , a = 



4a 



(21) 



Thus, for any given a > 4, the rest point equation has 
three solutions whenever h~ < b < , where 



In- 



,b-^ln'-±^-'-±^ (22) 
a — a 2 



For small values of T when a is sufficiently large (and 
positive), the two branches and b^ are well sepa- 
rated. When one increases T, however, at some critical 
value those two branches meet and a cusp bifurcation 
occurs 1^2 5j . The point where the two bifurcation curves 
meet can be shown to be (a, &) = (4,-2), and is called 
a cusp point. A saddle- node bifurcation occurs all along 
the boundary of the region, except at the cusp point, 
where one has a codimension-2 bifurcation - i.e., two pa- 
rameters have to be tuned for this type of bifurcation to 
take place [1^ . This boundary in the parameter space is 
shown in Fig. [2l 




FIG. 2: Demonstration of the cusp bifurcation in the space 
of parameters a and b for symmetric equilibria. 



B. General Case 



We now examine the most general case. We find it 
useful to introduce variables u = In , w = In . 
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FIG. 3: (Color online) Graphical representation of the general 
rest point equation for two different values of c : Intersections 
represent rest points. 



Then the interior rest point equations can be rewritten 
as 



u = 6 + a 



1 + e-" 



w = d + c- 



1 



1 + e-" 



(23) 



where a, b , c, and d have been already defined in 
Eqs.llllllTl Eliminating V we obtain 



1 b 
—u 

a a 



n -1 



1 + e 1+' 



■9{u) 



(24) 



The solution of Eq. [24] are the rest point (s) of the dy- 
namic. Its graphical representation is shown in Fig. |3l 

It is easy to see that < g{u) < 1. Furthermore, we 
have from Eq. [24] 



g'(u) = cg{l - g) 



4cosh^f 



(25) 



Thus, g{u) is a monotonically increasing (decreasing) 
function whenever c > (c < 0). 

Next, we classify the games according to the number 
of rest points they allow. Let us consider two cases: 

i) ac < 0: Note that in Eq. [24|the LHS is a monoton- 
ically increasing (decreasing) function for a > (a < 0). 
As stated above, RHS is also a monotonically increasing 
(decreasing) function whenever c > (c < 0). Con- 
sequently, whenever a and c have different signs, i.e. 
ac < 0, one of the sides is a monotonically increasing 
function while the other is a monotonically decreasing; 
thus, there can be only one interior rest point, which, 
due to the dissipative nature of the dynamics, is globally 
stable. An example of this class of game is Matching 
Pennies that will be discussed in Section ITVl 

ii) ac > 0: In this case it is possible to have one, two 
or three interior rest points. For the sake of concreteness, 
we focus on a > 0, c > 0, so that both the LHS and RHS 
of Eq. [Mjare monotonically increasing functions. 



Recall, that at the critical point when the second solu- 
tion appears, the LHS of Eg. [Ml should be tangential to 
g{u). Consider now the set of all tangential lines to g{u) 
in Eq. 1241 and let Smm and S^ax be the minimum and 
maximum value of the intercepts among those tangential 
lines for any u and Ty- The intercept of the line given 
by the LHS of Eq.[211 on the other hand, equals — ^, and 
is independent of the temperature. It is straightforward 
to check that multiple rest points are possible only when 

A full analysis along those lines (see Appendix [^ re- 
veals that the number of possible rest points depend on 
the ratios ^ and ^, as depicted in Fig. [3| First, consider 
the parameter range < — | < 1 (shaded light-grey 
region in Fig. [3]), which correspond to so called coor- 
dination games that have three NE. The learning dy- 
namics in these games can have three rest points, that 
intuitively correspond to the perturbed NE. In partic- 
ular, those rest points will converge to the NE as the 
exploration rates vanish. When a,c < 0, the parameter 
range < — f,— f < 1 corresponds to so called anti- 
coordination games. Those games also have three NE, so 
the learning dynamics can have three rest points. 

Let us now focus on light grey (not-shaded) regions in 
Fig. [4] The games in this parameter range have a single 
NE. At the same time, the learning dynamics might still 
have multiple rest points. Those additional rest-points 
exist only for a range of exploration rates, and disappear 
when both exploration rates Tx , TV are sufficiently low or 
sufficiently high; see Appendix [Bj for details. An example 
of this type game will be presented in Section [TVl 



d 




FIG. 4: (Color online) Characterization of different games in 
the parameter space with a, c > 0. Dark blue region cor- 
responds to games that can have only a single rest point, 
whereas the games in the light grey regions can have three 
rest-points. The shaded grey square corresponds to games 
that have three Nash equilibria. 

Note that the Fig. [3| was obtained by assuming that 
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Tx and Ty are independent parameters. Assuming some 
type of functional dependence between those two param- 
eters alters the above characterization. For instance, con- 
sider the case Tx = Ty — T. At the critical point we 
have (in addition to Eq. [24]) ag'{u) = 1, which yields 



4cosh^ f 
5(1-5) 



(26) 



It can be shown^ that when Tx = Ty = T the above 
conditions can be met only when 0<— 1<1,0< 
— I < 1 (shaded region in Fig. 01), which correspond to 
the domain of multiple NE: coordination (a, c > 0) and 
anti-coordination (a, c < 0) games. 

It is illustrative to write Eq.l^Hin terms of the original 
variables x and y: 



1 



x{l - x)y{l - y) 



(27) 



It can be seen that Eq. [12] is recovered when a — c and 
X = y. Furthermore, since < x,y < 1, the above 
condition can be satisfied only when ac > 16. 

a. Linear Stability Analysis We conclude this sec- 
tion by briefly elaborating on the dynamic stability of 
the interior rest points. Note that, whenever there is a 
single rest point it will be globally stable due to the dis- 
sipative nature of the dynamics. Thus, we focus on the 
case when there are multiple rest points. 

For the interior rest points, the eigenvalues of the Ja- 
cobian of the dynamical system Eqs. 1141151 are as follows: 



Ai,; 



-1 ± \/ac?/(l - y)x{l - x) 



(28) 



Let us focus on symmetric games and symmetric equi- 
libria (i.e. x = y). From Eq. [5S]we find the eigenvalues 
Ai.2 — —1 ± axo{l — xq), so that the stability condition 
is aa;o(l — a^o) < 1- Recalling that at the critical point 
we have a — — — r, it is straightforward to demon- 
strate that for the middle rest-point the above condition 
is always violated, meaning that it is always unstable. 
Similar reasoning shows that two other rest points are 
locally stable, and depending on the starting point of the 
learning trajectory, the system will converge to one of the 
two points. An example of the flows generated by the dy- 
namics for below-critical and above-critical exploration 



rates is depicted in Fig. 5(a) and 5(b) 



^ Indeed, substituting g(u) from Eg. 1241 into Eg. 1261 one formally 
obtains a quadratic equation for T, AT^ + BT + C = 0, A = 
cosh"(y2) 2 5 ^ (1 + 2^^)4, C = + where: a' = 

0.21 + 0,12 - Oil - a22 and c' = 621 -I- bi2 - bn - 622, b' = ai2 — 
022- Requiring that T is a real positive number yields 4AC < 0, 
or < —b/a < 1. With the similar reasoning the domain of d/c 
of multiple intersection isO<— ii/c<l. 




(b) 



FIG. 5: (Color online) Illustration of dynamical flow for a 
system with three (a) and single (b) rest points. Note that 
the middle rest point in (a) is unstable. 



IV. EXAMPLES 

We now illustrate the above findings on several games 
shown in Fig. [6l The row (column) number corresponds 
to the actions of the first (second) agent. Each cell con- 
tains a reward pair {aij,bji), where aij and bji are the 
corresponding elements of the reward matrices A and B. 



Prisoner's 
Dilemma 


C 


D 


C 


(3,3) 


(0,4) 


D 


(4,0) 


(2,2) 



iVIatching 
Pennies 


H 


T 


H 


(1,0) 


(0,1) 


T 


(0,1) 


(1,0) 



Coordination 
Game 


S 


H 


Hawl<-Dove 
Game 


H 


D 


S 


(6,6) 


(0,3) 


H 


(-3,-3) 


(1,-1) 


H 


(3,0) 


(2,2) 


D 


(-1,1) 


(0,0) 



FIG. 6: Examples of reward matrices for typical two-action 
games. 

Our first example is the Prisoner's Dilemma (PD) 
where each player should decide whether to Cooperate 
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(C) or Defect (D). An example of a PD payoff matrix 
is shown in Fig. [51 In PD tfie defection is a dominant 
strategy - it always yields a better reward regardless of 
the other player choice. Thus, even though it is beneficial 
for the players to cooperate, the only Nash equilibrium of 
the game is when both players defect. For Tx = Ty = 0, 
the dynamics always converges to the NE. 

In our PD example we have | = | = — 2, so accord- 
ing to Fig. m there is a single interior rest point for any 
Tx,Ty > 0. Furthermore, due to the dissipative na- 
ture of the dynamics, the system is guaranteed to con- 
verge to this rest point for any finite exploration rates. 
Note that this is in stark contrast from the behavior of 
e-greedy learning reported in [lo| , where the authors ob- 
served that, starting from some initial conditions, the 
dynamics might never converge, instead alternating be- 
tween different strategy regimes. The lack of convergence 
and chaotic behavior in their case can be attributed to 
the hybrid nature of the dynamics. 

Next, we consider Matching Pennies (MP), which is 
a zero sum game where the first (second) player wins if 
both players select the same (different) actions; see Fig.|6l 
This game does not have any pure NE, but it has a mixed 
NE at X* = y* = i. This mixed NE is a rest point of 
the learning dynamics at Tx = Ty = which is a center 
point surrounded by periodic orbits (2l| . For this game 
we have ac < 0. Thus, there can be only one interior rest 
point, which can be globally stable for any Tx,Ty > 0. 
Furthermore, a particular feature of this game is that 
finite Tx,Ty does not perturb the position of the rest- 
point (since the entropic term is zero for x = y ~ 

We now consider a coordination game (shaded area 
in Fig. 2]) where players have an incentive to select the 
same action. In the example shown in Fig. |6l the play- 
ers should decide whether to hunt a stag (S) or a hare 
(H). This game has two pure NE, (S,S) and (H,H), as 
well as a mixed NE at {x*,y*) = (-^,-f), which, for 
the particular coordination game shown in Fig. |6l yields 
X* = y* = 2/5. For sufficiently small exploration rates, 
the learning dynamics has three rest points that intu- 
itively correspond to the three NE of the game. Further- 
more, the rest points corresponding to the pure equilib- 
ria are stable, while the one corresponding to the mixed 
equilibrium is unstable. 

When increasing the exploration rates, there is a crit- 
ical line {T^,T^) so that for any Tx > T^,Ty > T^ 
only one of the rest points survives. In Fig. [7] we show 
the bifurcation diagram on the plane Tx — Ty* We find 
that most coordination games are characterized by a dis- 
continuous pitch-fork bifurcation (see Fig. 7(a)), where 
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FIG. 7: (Color online) Bifurcation diagram of the rest points 
for Tx = Ty = T : (a) Disconnected pitchfork, with mixed 
NE: (x^y*) = (2/5,2/5) (b) Continuous pitchfork, with 
mixed NE: (s*,^*) = (1/3,2/3). 



ever, for games with - -|- 



-1. This condition de- 



above the critical line the surviving rest point correspond 
to the risk- dominant NE ^. There is an exception, how- 



scribes games where none of the pure NE are strictly risk 
dominant, and where the mixed NE satisfies x* +y* = 1. 
The rest point structure undergoes a continuous pitch- 
fork bifurcation as shown Fig. |7(b)| whenever a = c and 
- + - = — 1. One can show that when the above condi- 

a c 

tion is met, the critical point uq that satisfies g'{uo) = ^ 
, -uo — - ~ g{uo), is also the inflection point of g{u), 
9"{uo) = 0. 

The other class of two-action games with multiple NE 
are so-called anti-coordination games where it is mutu- 
ally beneficial for the players to select different actions. 
In anti-coordination games, one has a, c < whereas 
< — ^ < 1, < — I < 1. A popular example is the so 
called Hawk-Dove game where players should choose be- 
tween an aggressive (H) or peaceful (D) behavior. This 
game has two pure NE, (H,D), (D,II), and a mixed NE 
at {x*,y*) = (— f , — ^)- An example is shown in Fig. [S] 
with a mixed NE at x* = y* = 1/3. 



* We find that the bifurcation structure is quaUtatively similar for 

the more general case Tx ^ TV. 
^ In a general coordination game, the strategy profile (1,1) is risk 



dominant if (ai2 — a22)(bi2 — 622) > (121 — aii)(b2i - bn). 
In symmetric coordination games (i.e., as shown in Fig. [4]l the 
strategy profile is risk-dominant if it yields a better payoff against 
an opponent that plays a uniformly mixed strategy. 
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FIG. 8: (Color online) Bifurcation in the domain of the games 
with a, c > 0, ^ > 0, — 5 > I > — 1. In this example we have: 
I — —0.8, ^ = 0.1: a) Rest point structure plotted against 
Tx for ry"< and Ty > T^. b) The rest point struc- 
ture plotted against Ty for T^" < Tx < T'j+{TY). In both 
graphs, the red dot-dashed lines correspond to the unstable 
rest points. 



Anti-coordination games have similar bifurcation 
structure compared to the coordination games. Namely, 
there is a critical line [T^^Ty) so that for any Tx > 
T^,Ty > Ty only a single rest point survives. As in 
the coordination games, the bifurcation is discontinuous 
for most parameter values. The condition for continuous 
pitch-fork bifurcation in the anti-coordination games is 
given by a = c and ^ = f • Thus, those games have a 
symmetric NE x* — y* ■ Furthrmore, the critical point 
where the second solution appears is also the inflection 
point of g"{uo) = 0. 

Finally, let us consider the games with a single NE, 
for which the learning dynamics can still have multiple 
rest points. To be specific, we focus on the case a, c > 0, 
for which the possible regimes are outlined in Fig. 21 In 
Fig. 8(a) , we show the dependence of the rest point struc- 
ture on the parameter Tx , for two different values of Ty , 
for - = 0.1, - = —0.8. It can be seen that for sufS- 
ciently small Tx, the learning dynamics allows a single 



rest point (that corresponds to the NE of the game). 
Similarly, there is single rest points whenever Ty is suffi- 
ciently hight. However, there is a critical exploration rate 
for agent Y, Ty, so that for any < Ty < Ty, there is a 
range T^^ {Ty) < Tx < T'^+{Ty), for which the dynam- 
ics allows three rest points. In contrast to coordination 
and anti-coordination games considered above, those ad- 
ditional rest points do not correspond to any NE of the 
game. In particular, they disappear when Tx,Ty are suf- 
ficiently small. We elaborate more on the appearance of 
those rest points in Appendix IB] 

Fig. |8(b)| shows the bifurcation diagram for the same 
game but plotted against Ty . Note that the two diagrams 
are asymmetric. In particular, in contrast to Fig. 8(a)[ 
here multiple solutions are possible even when Ty is arbi- 
trarily small (provided that T^' [Ty) < Tx < T^^{Ty)). 
This asymmetry is due to the fact that the agents' pay- 
off matrices represent different games. In this particular 
case, the first player's payoff matrix corresponds to a 
dominant action game, whereas the second player's pay- 
off matrix corresponds to a coordination game. Clearly, 
when Tx is very small, the first player will mostly select 
the dominant action, so there can be only a single rest 
point at small Tx- Increasing Tx will make the entropic 
term more important, until at a certain point, multiple 
rest points will emerge. 

The same picture is preserved for the parameter range 
^ ^ 1 5 < f < (the other light grey horizon- 
On the other hand, the players effectively 



-1, 



< 

tal stripe 

exchange roles in the parameter ranges corresponding 
to the vertical stripes: ^ > 0,— 1 < ^ < — ^ and 
-<— 1,— ■i<-<0. In this case, there is a critical 
exploration rate T^ , so that for any < Tx < T^ , there 
is a range Ty- (Tx) < Ty < Ty+(Tx), for which the dy- 
namics allows three rest points. 

Finally, we note that the rest point behavior is different 
in the light grey regions where the parameters are also 
confined to | > 0, f < -1 and ^ < -1, | > 0. In those 
regions, multiple rest points are available only when both 
Tx and Ty are strictly positive, i.e., T^^ > 0, Ty^ > 0. 



V. DISCUSSION 

We have presented a comprehensive analysis of two 
agent Q-learning dynamics with Boltzmann action selec- 
tion mechanism, where the agents exploration rates are 
governed by temperatures Tx,Ty. For any two action 
game at finite exploration rate the dynamics is dissipa- 
tive and thus guaranteed to reach a rest point asymptot- 
ically. We demonstrated that, depending on the game 
and the exploration rates, the rest point structure of 
the learning dynamics is different. When Tx = Ty, for 
games with a single NE (either pure or mixed) there is 
a single globally stable rest point for any positive explo- 
ration rate. Furthermore, we analytically examined the 
impact of exploration/noise on the asymptotic behavior, 
and showed that in games with multiple NE the rest- 
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point structure undergoes a bifurcation so that above 
a critical exploration rate only one globally stable solu- 
tion persists. Previously, a similar observation for certain 
games was observed numerically in Ref. [26], where the 
authors studied Quantal Response Equilibrium (QRE) 
among agents with bounded rationality. In fact, it can 
be shown that QRE corresponds to the rest-point of the 
Boltzmann Q-learning dynamics. A similar bifurcation 
pictures was also demonstrated for certain continuous ac- 
tion games (27j . 

In general, we observed that for Tx ^ TV , the learning 
dynamics is qualitatively similar for games with multiple 
NE. Namely, there is a bifurcation at critical exploration 
rates and Ty, so that the learning dynamics allows 
three (single) rest points below (above) those critical val- 
ues. In particular, the rest points converge to the NE of 
the game when Tx,Ty 0. What is perhaps more in- 
teresting is that for certain games with a single NE, it 
is possible to have multiple rest points in the learning 
dynamics when Tx Ty. Those additional rest points 
persist only for a finite range of exploration rates, and 
disappear when the exploration rates Tx and Ty tend to 
zero. 

We suggest that the sensitivity of the learning dynam- 
ics on exploration rate can be useful for validating var- 
ious hypotheses about possible learning mechanisms in 
experiments. Indeed, most empirical studies so far have 
been limited to games with a single equilibrium, such as 
matching pennies, where the dynamics is rather insensi- 
tive to the exploration rate. We believe that for different 
games (such as coordination or anti-coordination game), 
the fine-grained nature of the rest point structure, and 
specifically, its sensitivity to the exploration rate, can 
provide much richer information about learning mecha- 
nisms employed by the agents. 

Note Added: After completing the manuscript, we be- 
came aware of a very recent work reporting similar re- 
sults [23], which studies convergence properties and bi- 
furcation in the solution structure using local stability 
analysis. For games with a single rest point such a Pris- 
oner's Dilemma, local stability is subsumed by the global 
stability demonstrated here. The bifurcation results are 
similar, even though [2^ studies only coordination games 
and does not differentiate between continuous and dis- 
continuous pitchfork bifurcation. Finally, the analytical 
form of the phase diagram Eq.[55]for the symmetric case, 
as well as the possibility of multiple rest points for games 
with a single NE demonstrated here, are complementary 
to the results presented in [2^. 
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Appendix A: Classification of games according to 
the number of allowable rest-points 

Here we derive the conditions for multiple rest-points. 
We assume a, c > for the sake of concreteness. 

Consider the set of all the tangential lines to g{u) (see 
Eq. [24l) , and let 5ty (") be the intercept of the tangential 
line that passes through point u, 5ty{u) = g(u) — g'{u)u: 
Here the subscript indicates that the intercept depends 
on the exploration rate Ty via coefficients c and d. The 
extremum of function Sty (u) happens at = —g"u = 
where: 



cg(i - 9) 

16 cosh^ 



ctanh 



d 

2 + r 



c/2 



2 sinh ? 



(Al) 

Let uq be the point where g"{uQ) = 0. A simple analysis 
yields that uq > whenever d < —c/2, and uq < Q 
otherwise. Next, let Smin = niin„_Tv (u) and Smax — 
maxu.Ty (5Ty(u), where minimization and maximization 
is over both u and Ty. It can be shown that there can 
be multiple solutions only when Smin < ~^ < Smax- 

We now consider different possibilities depending on 
the ratio |. Due to symmetry, it is sufficient to consider 
- < — i. We differentiate the following cases: 

c 2 ^ 

z) — 1 < ^ < — i: In this case one has Smin ~ —00, 
Smax = 1- Thus, there will be one rest point whenever 
A<-1. 

a 

< —1: In this case one has Smn-r — \, thus, there 



will be single rest point whenever \ < —\- Furthermore, 
although an analytical expression for Smin is not avail- 
able, the corresponding boundary can be found by nu- 
merically solving a transcendental equation —\= Smin 
for different -. 

c 

Repeating the same reasoning for | > — i yields the 
different regions depicted in Fig. S) 



Appendix B: Appearance of multiple rest points in 
games with single NE 

We now elaborate on games with single NE for which 
the learning dynamics still can have multiple rest points. 
For the sake of concreteness, let us consider one of the 
regions in Fig. H] that corresponds to ^ > 0, — 1 < 
^ < —1/2. The graphical representation of the rest 
point equation is shown in Fig. |9l For a given Ty, the 
two lines correspond to the critical values of T^" [Ty) 
and T'^{Ty). Let us consider the case Ty = 0. It is 
easy to see that in this limit g{u) becomes a step func- 
tion, g{u) = 9{u — ii), where u is found by requiring 
f + i+e-^ — 0' which yields u = In^^. Simple calcu- 
lations yield T^- {Ty = 0) = |^ and T"^- (Ty = 0) = 
|(^ + 1), where a = a2i + ai2 - an - 022 = aTx- For 
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FIG. 9: Graphical illustration of the multi-rest point equation 
for a game with a single NE. Here a, c>0, ^ = 5, f = ~|- 



general Ty > 0, the corresponding values {Ty) and 
T^'^(Tv') can be found numerically. Finally, note that 
when increasing Ty, there is a critical exploration rate 
Ty = Ty so that for Ty > Ty the multiple solutions will 
disappear. It is easy to see that Ty corresponds to the 
point when the maximum value of the intercept to g{u) 
for a given Ty equals — ^. 
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